Skip to content

Add document deletion and update functionality#3

Merged
ezimuel merged 8 commits intoezimuel:mainfrom
danielebarbaro:feat/update-delete
Mar 31, 2026
Merged

Add document deletion and update functionality#3
ezimuel merged 8 commits intoezimuel:mainfrom
danielebarbaro:feat/update-delete

Conversation

@danielebarbaro
Copy link
Copy Markdown
Collaborator

  • Add soft-delete for HNSW (nodes kept for graph connectivity, excluded from results)
  • Add full removal from BM25 index via removeDocument()
  • Add deleteDocument() and updateDocument() to VectorDatabase
  • Persist deleted state in meta.json and restore on open()
  • Update count() to return active documents only
  • Add comprehensive tests for delete/update operations

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class delete and update operations to VectorDatabase, including soft-delete support in the HNSW graph, hard removal from the BM25 index, and persistence of deleted state across saves/opens.

Changes:

  • Add deleteDocument() / updateDocument() to VectorDatabase, and make count() report only active (non-deleted) documents.
  • Implement soft-delete filtering + deleted-state import/export in HNSW\Index, and add BM25\Index::removeDocument() for full removal from text search.
  • Extend persistence and tests to cover delete/update behavior and restore deleted state on reopen.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/VectorDatabase.php Adds delete/update APIs, persists deleted set in meta.json, and updates count() to exclude deleted nodes.
src/HNSW/Index.php Introduces a deleted-node set, filters deleted nodes from search results, and exports/imports deleted IDs.
src/BM25/Index.php Adds removeDocument() to fully remove a document from the inverted index.
tests/VectorDatabaseTest.php Adds unit tests validating delete/update behavior in-memory across vector/text search and count.
tests/PersistenceTest.php Adds persistence tests for deleted-state restoration and update persistence.
README.md Documents the new delete/update APIs and their persistence semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/HNSW/Index.php Outdated
Comment on lines +367 to +371
// Filter out soft-deleted nodes and take the k nearest.
if (!empty($this->deleted)) {
$W = array_values(array_filter(
$W,
fn(array $pair) => !isset($this->deleted[$pair[1]])
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering soft-deleted nodes only after searchLayer() can make search() return fewer than $k results (even 0) when $ef is small (e.g. $ef === $k) and a top candidate is deleted, despite other active nodes existing. Consider constructing the final top-k by skipping deleted nodes and/or retrying with a larger ef until you have up to $k active results.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with commit a201eb1

Comment thread src/BM25/Index.php Outdated
Comment on lines +213 to +217
// Remove from inverted index.
foreach ($this->invertedIndex as $term => &$postings) {
unset($postings[$nodeId]);
// Remove empty posting lists to save memory.
if (empty($postings)) {
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeDocument() scans the entire $invertedIndex vocabulary to remove a single nodeId, which is O(|V|) per delete and can become expensive as the index grows (delete/update are now public APIs). A more scalable approach is to track per-document term lists on insert so deletion only touches terms that were present in the removed document.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c79b9ca

Comment thread tests/PersistenceTest.php
Comment on lines +411 to +415
$results = $loaded->vectorSearch([1.0, 0.0], k: 3);
$ids = array_map(fn($r) => $r->document->id, $results);
self::assertNotContains(1, $ids);
self::assertContains(2, $ids);
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only checks that user-visible ID 1 isn’t present, but after open() the deleted node’s stub Document ID falls back to its nodeId (e.g. 0), so the test could still pass even if the deleted node is erroneously returned. Consider also asserting the result count is 2 (only two active docs remain) and/or explicitly asserting the deleted nodeId isn’t returned in $ids for this fixture.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a real issue

Comment thread src/VectorDatabase.php Outdated
Comment thread src/VectorDatabase.php Outdated
Comment thread src/VectorDatabase.php
Comment on lines +201 to +205
if ($this->path !== null) {
$docFile = $this->path . '/docs/' . $nodeId . '.bin';
if (file_exists($docFile)) {
unlink($docFile);
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With persistence enabled, doc files may still be in-flight from an async DocumentStore::write() (pcntl_fork). If a document is deleted before that write finishes, a late child write can recreate {nodeId}.bin after the delete, so the file may not actually stay deleted. Consider waiting for outstanding writes (or adding per-node cancel/atomic write semantics) before removing the file.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with commit 65b16da

ezimuel and others added 2 commits March 26, 2026 09:53
Add return value check on document delete

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Check for successfully unlink()

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/VectorDatabase.php Outdated
Comment on lines +209 to +213
// Suppress PHP warning and handle failure explicitly to keep
// on-disk state consistent with in-memory indexes.
if (!@unlink($docFile) && file_exists($docFile)) {
throw new \RuntimeException(
"Failed to delete persisted document file: {$docFile}"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addDocument() can write doc files asynchronously (fork) via DocumentStore::write(), but deleteDocument() unlinks {nodeId}.bin without waiting for outstanding writes. A late child write can recreate the file after deletion, leaving on-disk state inconsistent. Consider waiting for pending writes (e.g., DocumentStore::waitAll()) or adding per-node cancel/atomic write semantics before unlinking.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with the commit 65b16da

Comment thread src/VectorDatabase.php Outdated
Comment on lines +205 to +216
// Delete document file from disk if persistence is enabled.
if ($this->path !== null) {
$docFile = $this->path . '/docs/' . $nodeId . '.bin';
if (file_exists($docFile)) {
// Suppress PHP warning and handle failure explicitly to keep
// on-disk state consistent with in-memory indexes.
if (!@unlink($docFile) && file_exists($docFile)) {
throw new \RuntimeException(
"Failed to delete persisted document file: {$docFile}"
);
}
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With persistence enabled, deleteDocument() removes docs/{nodeId}.bin immediately but the on-disk indexes (meta.json, hnsw.bin, bm25.bin) are only updated on save(). If the process crashes (or save() is not called), the folder can be left with indexes that still reference the document while its doc file is missing, causing later searches to fail when hydrating results. Consider making deletion crash-safe (e.g., defer physical file removal until save(), or persist a tombstone/updated meta atomically as part of deletion).

Suggested change
// Delete document file from disk if persistence is enabled.
if ($this->path !== null) {
$docFile = $this->path . '/docs/' . $nodeId . '.bin';
if (file_exists($docFile)) {
// Suppress PHP warning and handle failure explicitly to keep
// on-disk state consistent with in-memory indexes.
if (!@unlink($docFile) && file_exists($docFile)) {
throw new \RuntimeException(
"Failed to delete persisted document file: {$docFile}"
);
}
}
// NOTE:
// We intentionally do NOT delete the persisted document file here.
// The in-memory indexes are updated immediately, but on-disk index
// structures (meta.json, hnsw.bin, bm25.bin) are only updated on save().
// Deleting the file now could leave the persisted index referencing a
// missing document if the process crashes before save() is called.
// Any crash-safe physical cleanup of unused document files should be
// performed as part of a persistence step that also updates the indexes.
if ($this->path !== null) {
// Defer physical deletion to a persistence routine that can update
// both document files and indexes atomically.

Copilot uses AI. Check for mistakes.
Comment thread src/BM25/Index.php Outdated
Comment on lines +214 to +218
foreach ($this->invertedIndex as $term => &$postings) {
unset($postings[$nodeId]);
// Remove empty posting lists to save memory.
if (empty($postings)) {
unset($this->invertedIndex[$term]);
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeDocument() removes a single doc by iterating over the entire $invertedIndex vocabulary and unsetting the nodeId from each postings list (O(|V|) per delete). Since delete/update are public APIs now, this can become a bottleneck on large indexes. Consider tracking per-document term lists on insert so deletions only touch terms that appeared in that document.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. 🤔

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with commit c79b9ca

Copy link
Copy Markdown
Owner

@ezimuel ezimuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied some fixes to the Copilot suggestions.

@ezimuel ezimuel merged commit 5ae150a into ezimuel:main Mar 31, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants